library(tidyverse)
library(rmarkdown)
library(ggrepel)
states <- map_data("state")
state_names <- states %>% group_by(region) %>% summarize(long=mean(long), lat=mean(lat))
acc <- read.csv("https://raw.githubusercontent.com/xdaiISU/ds202materials/master/hwlabs/fars2017/accident.csv", stringsAsFactors = FALSE)
ppl <- read.csv("https://raw.githubusercontent.com/xdaiISU/ds202materials/master/hwlabs/fars2017/person.csv", stringsAsFactors = FALSE)
glc <- readxl::read_xlsx("FRPP_GLC.xlsx")
# Preprocess GLC
glc$STATE = as.numeric(glc$`State Code`)
glc_states <- glc %>% group_by(STATE) %>% summarize(state_name=first(`State Name`))
acc_with_state_names <- acc %>% left_join(glc_states, by=c('STATE'))
acc %>% group_by(DAY_WEEK) %>%
summarize(number_accidents=n()) %>%
mutate(weekday=c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')) %>%
select(-DAY_WEEK) %>%
subset(select=c(2,1))
## # A tibble: 7 x 2
## weekday number_accidents
## <chr> <int>
## 1 Sun 5360
## 2 Mon 4374
## 3 Tue 4347
## 4 Wed 4314
## 5 Thu 4621
## 6 Fri 5358
## 7 Sat 5873
Using the data manual, we know that values of DAY_WEEK 1-7 corresponds to days, Sun-Sat, which we can use to get the weekday label for each integer representation. In the resulting table, we can see that on Fridays, Saturdays, and Sundays, there are generally a significantly greater number of fatal accidents compared to Monday, Tuesday, Wednesday, or Thursday.
The greatest number of fatal accidents occur on Saturdays.
fatal <- ppl %>% filter(INJ_SEV==4)
paged_table(fatal)
Here we see a few entries of the dataset created which contains only the people who suffered fatal injuries.
# First get the MAKE for each accident
dat <- acc_with_state_names %>% inner_join(ppl, by=c('ST_CASE', 'STATE')) %>% select(STATE, state_name, MAKE)
most_dangerous <- dat %>%
filter(!is.na(MAKE)) %>%
group_by(STATE, state_name, MAKE) %>%
summarize(number_accidents=n()) %>%
filter(number_accidents==max(number_accidents))
## `summarise()` has grouped output by 'STATE', 'state_name'. You can override using the `.groups` argument.
most_dangerous %>% paged_table
Here we can see the most dangerous make ID for each state ID, as well as the number of fatal accidents for each of those state/make combinations.
states$region = toupper(states$region)
avg_state_locations <- states %>% group_by(region) %>% summarize(avg_long=mean(long), avg_lat=mean(lat))
dangerous_with_location <- most_dangerous %>% mutate(region=state_name) %>% inner_join(avg_state_locations, by=c('region'))
ggplot(states, aes(x=long, y=lat)) + geom_polygon(aes(group=group)) +
geom_text_repel(data=dangerous_with_location, aes(x=avg_long, y=avg_lat, label=MAKE), color='green')
I define the most dangerous vehicle to be the MAKE code which is involved in the highest number of fatal accidents for each state. Here we can see the MAKE code for the most dangerous vehicle by that definition for each state.
acc_ppl <- acc %>% inner_join(ppl)
## Joining, by = c("STATE", "ST_CASE", "VE_FORMS", "COUNTY", "DAY", "MONTH", "HOUR", "MINUTE", "RUR_URB", "FUNC_SYS", "HARM_EV", "MAN_COLL", "SCH_BUS")
paged_table(acc_ppl)